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I  Technical  Objectives 

The  overarching  goal  of  this  project  is  the  creation  of  the  leading  computational  chemistry 
workbench,  making  the  premier  computational  chemistry  codes  and  databases  easily  acces¬ 
sible  to  chemistry  practitioners.  This  has  been  accomplished  by  creating  an  open,  extensible 
application  framework  that  puts  computational  tools,  data,  and  domain-specihc  knowledge 
at  the  hngertips  of  chemists.  A  data-centric  approach  to  chemistry,  storing  data  in  a  search¬ 
able  database,  empowers  users  to  efficiently  collaborate,  innovate,  and  push  the  frontiers  of 
research  forward. 

As  the  power  of  our  computational  resources  grows,  computational  chemists  face  a  grow¬ 
ing  discrepancy  between  our  ability  to  run  calculations/simulations  and  our  ability  to  mean¬ 
ingfully  store,  search,  retrieve  and  analyze  data.  As  the  sophistication  of  the  computational 
codes  grow  and  access  to  powerful  computational  resources  becomes  more  commonplace, 
there  is  an  increasingly  steep  learning  curve  to  effectively  using  new  computational  tools  and 
analyzing  their  output.  Our  objective  is  to  make  the  lives  of  computational  chemists  easier 
by  making  these  tools  accessible  to  a  wider  range  of  chemists.  The  specihc  goals  of  the  Phase 

II  project  are  outlined  below,  and  summarized  in  Figure  1. 


•  Develop 

•  Develop 

•  Develop 

•  Develop 

•  Develop 

•  Develop 

•  Develop 


an  extensible,  plugin-based,  flexible  chemistry  application  and  library 
an  application  for  easily  using  HPC  resources  from  desktop  applications 
a  specialized  desktop  database  application  for  chemical  information 
chemistry-specihc  analysis  and  visualization  techniques 
a  specialized  file  format  capable  of  storing  large  data 
a  library  for  ingesting  and  calculating  electronic  structure 
a  “chemistry  workbench”  offering  state-of-the-art  tools  to  the  community 


Kitware  has  been  very  successful  for  more  than  a  decade  by  building  collaborative  inno¬ 
vation  platforms  that  allow  us  to  work  with  the  best  research  groups  in  the  world  to  leverage 
their  research  and  development.  This  framework  positions  us  to  be  able  to  pursue  fruitful 
collaborations  in  chemistry  and  several  other  related  areas. 


2  Work  Summary 

This  section  summarizes  the  work  done  in  the  during  the  two-year  Phase  II  SBIR  project, 
with  discussion  of  progress  made  in  achieving  the  overall  goals  of  the  project,  as  outlined  in 
the  proposal  referenced.  This  project  involves  three  major  areas  of  development  (shown  in 
Figure  1),  with  several  open  source  projects  supporting  the  work  (shown  in  Figure  2). 

The  projects  shown  in  Figure  2  summarize  those  being  developed  or  extended  as  part  of 
this  project,  with  an  indication  of  application  domain  and  type.  There  are  three  user  facing 
graphical  applications  that  are  aimed  at  being  used  from  the  desktop  to  do  research  in  the 
broad  area  of  computational  chemistry:  Avogadro  2,  MongoChem,  and  MoleQueue.  Each 
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Figure  1:  Open  Chemistry  workflow,  with  Open  Chemistry  applications  filling  the  roles 
in  green,  and  the  flow  of  data  indicated  by  arrows. 


Figure  2:  Open  Chemistry  projects  grouped  by  basic  dependency  and  application  area. 
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of  these  applications  is  specialized  to  deal  with  distinct  domains,  but  designed  to  be  used  in 
unison  with  the  other  applications. 

These  projects  build  upon  existing  libraries  where  possible.  They  are  written  in  C++, 
make  use  of  several  cross-platform,  open  source  libraries,  such  as  Qt,^^]  and  CMake,[^’^l  in 
order  to  build  on  many  platforms.  In  addition  to  the  more  generic  libraries  and  tools,  several 
specialized  libraries  are  also  being  developed  or  extended  in  order  to  support  the  Open 
Chemistry  project.  The  VTK  project^"^^®]  is  one  of  the  oldest  C++  visualization  libraries 
still  actively  developed,  and  is  one  of  Kitware’s  core  projects.  It  has  been  augmented  with 
additional  chemical  data  structures  and  visualization  types  that  complement  the  existing 
visualization  approaches  in  order  to  make  it  a  more  compelling  choice  for  chemists.  It 
sits  between  the  GUI/visualization  libraries  and  the  core/command  line  in  where  it  can 
be  deployed  and  used — featuring  both  data  pipeline,  CPU  or  GPU  rendering  and  parallel 
techniques — for  data  analysis  and  reduction. 

The  AvogadroLibs  components  provide  the  majority  of  the  chemistry-specihc  function¬ 
ality  necessary,  such  as  standard  descriptors,  hie  formats,  force  helds,  data  structures,  algo¬ 
rithms,  and  post-processing  calculations  on  computational  chemistry  output  hies.  In  addition 
to  new  functionality  developed  in  these  libraries,  the  Open  Babel^^’®!  and  RDKit^®!  libraries 
can  also  be  used;  for  example,  in  the  generation  of  2D  chemical  structure  depiction  in  batch 
mode  and  hie  format  support /conversion. 

2.1  Software  Process  and  Project  Dissemination 

The  Open  Chemistry  applications  and  libraries^^®]  are  developed  as  independent  projects 
grouped  under  the  Open  Chemistry  project,  with  a  community  site  at  openchemistry.org. 
Many  open  source  projects  use  a  somewhat  standard  software  process, which  has  been 
adopted  in  a  slightly  modihed  form  for  the  Open  Chemistry  projects,  as  shown  in  Figure  3. 

Several  key  resources  have  been  put  in  place  for  the  projects: 

•  Community  website  dedicated  to  Open  Chemistry  projects 

•  Git  source  code  repositories  (Kitware,  mirrored  to  Github  and  Gitorious) 

•  Online  code  review  tool  (Gerrit) 

•  Online  software  quality  dashboards  (CDash) 

•  Community  wiki  pages  (MediaWiki) 

•  Bug  tracking  and  project  management  tools 

•  Mailing  lists 

The  community  website  (Figure  4)  acts  as  an  entry  point  to  the  project,  and  gives  a  brief 
introduction  to  the  projects  with  links  to  specihc  resources.  The  projects  use  permissive, 
non-reciprocal  BSD  licenses,  and  distributed  version  control  (Git)  in  order  to  enable  cus¬ 
tomization  of  private  branches  and  shared  open  branches.  Git  also  offers  the  possibility  of 
mirroring  in  multiple  locations,  with  full  access  to  the  history  and  private  mirrors  possible 
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Figure  3:  The  software  process  used  for  Open  Chemistry  projects. 


within  organizational  nnits.  This  has  been  successfnlly  used  with  customers  requiring  cus¬ 
tomization  that  must  remain  private,  while  minimizing  the  maintenance/integration  burden 
associated  with  traditional  centralized  version  control  systems. 

The  Gerrit  code  review  system, developed  by  Google  as  an  open  source  project  for 
the  Android  operating  system,  enables  online  review  of  code  submissions  from  anyone  while 
retaining  control  of  what  code  is  accepted  into  the  code  base.  This  has  been  combined  with 
nightly  software  build  testing  on  all  three  major  platforms  for  merged  code  and  testing  of 
proposed  changes  using  GDash@Homef^^]  (an  open  source  project  developed  at  Kitware  to 
address  the  need  for  testing  arbitrary  branches  automatically).  This  level  of  automation 
gives  Open  Ghemistry  projects  the  ability  to  maintain  high  code  quality,  and  reviewers  are 
free  to  focus  on  verification  of  code  correctness  while  the  automated  systems  assure  that 
portable  code  that  works  on  all  major  platforms.  This  process  is  summarized  in  Figure  3. 

The  projects  are  in  independent  code  repositories  which  are  then  included  in  an  Open 
Ghemistry  repository  that  is  capable  of  building  all  of  the  projects,  along  with  their  major 
dependencies,  on  the  mainstream  platforms  (Linux,  Windows  and  Mac).  The  Open  Ghem¬ 
istry  repository  provides  an  easy  entry  point  for  new  developers,  and  using  GDash  with 
nightly  testing  provides  binaries  for  Windows  and  Mac  that  are  automatically  generated 
every  night.  These  are  available  both  on  the  dashboard  and  on  a  site  set  up  to  help  users 
find  the  appropriate  binaries. 
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0  Open  Chemistry  + 

ii  9  www.openchemistry.org  IH'^  Google  ★  I  e  I  = 


.  en 

;Kemjstry 


About  Projects  Help  Developer  Tools  Downloads  ^ 


[  Introducing 
f  Avogadro  2 

Download  Now 


Open  Chemistry  is  a  suite  of  permissively  licensed  cross-platform  tools  that 
provide  reusable  libraries  and  end-user  applications  for  computational 
chemistry,  materials  science,  and  related  areas. 


Avogadro  2  is  a  rewrite  of 
Avogadro.  designed  to  be 
^  V-,  permissively  licensed,  cross 
platform  and  scalable  while 
maintaining  flexibility  It  provides  a  desktop 
application  that  can  easily  be  extended  with 
plugins,  or  small  Python  scripts  that  are 
dynamically  added  to  the  interface 


The  MoleQueue  application 
provides  desktop  integration  of 
high-performance  computing 


(HPC)  resources,  along  with 
local  execution  of  starxlalone  code  It  is  a  small, 
Qt-based,  system-try  resident  application  that 
provides  local  services  over  a  JSON-RPC  2 
based  interface. 


MongoChem  uses  MongoDB  to 
store  chemical  data,  and 
reuses  Avogadro  2,  VTK,  arxl 
elements  of  Open  Babel  to 
provide  desktop  cheminformatics  The 
application  aims  to  provide  a  view  of  large 
collections  of  molecules,  enabling  you  to  find 
the  ones  with  the  properties  you  are  most 
interested  in 


Figure  4:  The  Open  Chemistry  homepage  at  openchemistry.org 


2.2  Software  Repositories  and  Statistics 

Figure  2  shows  the  high  level  overview  of  the  projects  developed  as  part  of  this  SBIR  project, 
along  with  VTK  which  was  extended.  These  artifacts  can  be  accessed  through  the  openchem- 
istry.org  website,  shown  in  Figure  4  and  are  available  for  inspection  by  all  under  permissive 
open  source  licenses.  The  Avogadro  1.x  and  VTK  projects  remain  in  software  repositories 
that  were  established  before  this  project  began,  and  both  projects  were  enhanced  as  part  of 
this  work.  There  are  also  minor  contributions  to  various  projects  this  work  depends  upon. 

The  vast  majority  of  the  development  was  focused  on  the  Open  Chemistry  projects  which 
were  placed  in  new  software  repositories.  Some  of  the  development  statistics  are  summarized 
below  for  the  Open  Chemistry  umbrella  project  that  contains  all  other  projects,  individual 
statistics  could  also  be  extracted  if  desired.  The  Open  Chemistry  projects  have: 

•  2,678  commits  (first  recorded  in  March,  2011) 

•  15  distinct  contributors  to  the  code 

•  118,864  lines  of  code 


Distribution  Statement  A:  Distribution  is  approved  for  public  release;  distribution  is  unlimited. 


Topic  AlO-llO 


Proposal  A2-4714 


Kitware,  Inc. 


It  is  interesting  to  look  at  the  Avogadro  2  project,  which  is  the  largest  part  of  the  Open 
Chemistry  project  (nearly  65%  by  lines  of  code  developed).  This  is  due  to  the  Avogadro 
Libraries  repository  that  contains  a  lot  of  common  code  reused  in  the  other  applications. 
Taking  the  libraries,  application,  and  data  repositories  into  account; 

•  1,115  commits  (hrst  recorded  in  October,  2011) 

•  13  distinct  contributors  to  the  code 

•  76,687  lines  of  code 

These  projects  will  be  described  in  more  detail  in  the  following  sections,  the  above  statis¬ 
tics  are  intended  only  to  provide  some  high-level  numbers  on  the  scale  of  the  code  developed, 
and  the  number  of  developers  contributing  code.  In  addition  to  these  numbers  the  projects 
are  leveraging  code  from  major  libraries  such  as  VTK  which  has  in  excess  of  1.47  million 
lines  of  code,  from  over  200  contributors  with  its  hrst  commit  recorded  in  January  of  1994. 

2.3  Data  Models  and  Communication  Strategies 

Early  on  in  the  development  of  the  project,  the  Javascript  Object  Notation  (JSON)  for¬ 
mat  was  settled  upon  for  simple  serialization/deserialization  of  data,  and  inter-process 
communication.  The  JSON  format  is  a  simple  industry  standard  being  increasingly  used  in 
places  where  formats  such  as  XML  were  once  used.  It  has  the  distinct  advantage  of  being  a 
very  simple  format  that  be  parsed  easily,  and  has  support  in  a  diverse  array  of  programming 
languages  from  compiled  languages  such  as  C,  C-I--I-  and  Fortran  through  to  Java,  Python, 
Perl,  and  JavaScript. 

At  its  core,  the  JSON  data  structure  consists  of  key/value  pairs,  objects,  and  arrays. 
These  concepts  are  universal  to  most  programming  languages,  enabling  a  great  deal  of  free¬ 
dom  in  language  choice  for  data  exchange.  There  are  several  C-I--I-  JSON  libraries  available, 
and  two  were  chosen  for  use  in  the  Open  Chemistry  project — JsonCpp  which  is  a  very  small 
MIT  licensed  library  using  only  STL,  and  Qt  5’s  JSON  classes  which  were  backported  to  Qt 
4.8  to  enable  its  use  in  Qt  4.8  and  Qt  5  based  projects.  The  Python  language  has  native 
support  for  JSON  data  structures  using  dictionaries,  and  JavaScript  support  is  strong. 

The  MongoDB  project!^®]  was  chosen  as  a  scalable  NoSQL  data  store  for  the  cheminfor- 
matics  components  of  this  work.  The  MongoDB  project  uses  a  binary  form  of  JSON  called 
BSON,[^®]  which  follows  many  of  the  same  principles  as  JSON,  but  stores  data  in  raw  binary 
and  has  optimized  the  data  structures  to  support  fast  reading  and  writing  of  documents. 
This  means  that  moving  data  from  the  backend  data  store  to  applications  and  over  inter¬ 
process  communication  channels  is  a  simple  process.  Several  libraries  are  also  available  with 
native  BSON  support,  such  as  C,  C-I--I-,  and  Python.  BSON  has  the  distinct  advantage  of 
10  speed  and  the  ability  to  store  raw  binary  data,  such  as  PNG  images,  binary  hie  fragments 
etc.  with  binary  data  length  encoded  in  the  standard  representation  and  support  for  most 
basic  types. 

The  use  of  JSON  and  BSON  in  these  projects  also  prompted  the  development  of  a 
JSON/BSON  data  model  to  represent  chemical  structures.  This  was  developed  to  mirror 
many  of  the  structures  already  developed  for  the  Chemical  Markup  Language,  and  transla¬ 
tion  between  the  two  formats  should  be  lossless.  They  are  both  extensible  formats  building 
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on  widely-accepted  industry  standard  data  exchange  formats.  Neither  is  especially  suited  to 
very  large  data,  but  can  be  coupled  with  a  separate  binary  data  store  to  give  semantically 
rich  data  documents  that  point  to  larger  blobs  of  binary  data  when  appropriate.  An  example 
of  a  small  structure  in  Chemical  JSON: 


f 

"chemical  json":  0, 

"name":  "Ethane", 

"inchi" :  " l/C2H6/cl -2/hl -2H3 " , 

"formula" :  "C  2  H  6" , 

" atoms  "  :  { 

" elements  "  :  { 

"number":  [  1,  6,  1,  1,  6,  1, 

>, 

"coords"  { 


1 . 185080  , 

-0 . 003838  , 

0 . 987524  , 

0.751621  , 

-0 . 022441  , 

-0 . 020839  , 

1 . 166929  , 

0 . 833015  , 

-0 . 569312  , 

1 . 115519  , 

-0 . 932892  , 

-0 .514525  , 

-0.751587  , 

0 . 022496  , 

0 . 020891  , 

-1 . 166882  , 

-0 . 833372  , 

0 . 568699  , 

-1 . 115691  , 

0 . 932608  , 

0 .515082  , 

-1 . 184988  , 

0 . 004424  , 

-0.987522  ] 

} 

>, 

" bonds " :  { 

" connections " :  { 

" index "  :  [  0 ,  1  , 

1,  2, 

1,  3, 

1,  4, 

4,  5, 

4,  6, 

4,  7  ] 

>, 

"order":  [1,  1,  1,  1,  1,  1,  1] 

"proportiss"  i  { 

"molecular  mass":  30.0690, 
"melting  point":  -172, 

"boiling  point":  -88 

} 

> 


1  ] 


The  key  names  map  well  to  the  XML  nodes  in  CML  documents,  and  this  structure 
can  easily  be  stored  directly  in  MongoDB  as  a  document  (or  object  within  a  document)  or 
passed  between  processes  as  JSON.  Readers  can  check  for  the  existence  of  known  keys,  and 
JSON  documents  can  be  built  up  by  various  subroutines  using  a  simple  in-memory  model. 
This  format  maps  well  to  BSON  documents  where  each  key/value  has  a  type  and  length 
held  before  the  actual  data,  using  arrays  of  boats,  doubles,  integers,  etc.  maximizes  storage 
efficiency,  binary  read  speed  and  ability  to  skip  3d  coordinates  efficiently  when  properties 
are  the  key  of  interest  for  example.  The  equivalent  CML  is  shown  below  to  aid  comparison. 
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<?xml  version= " 1 . 0 "  encoding= " UTF -8 " ? > 

<molecule  xmlns  =  " http : // www . xml -cml . org/ schema" 

xmlns :cml="http:// www . xml-cml . org/dict/ cml" 
xmlns :units="http: / /www . xml -cml . org/ units /units " 
xmlns ;xsd="http: // www . w3c . org/2001/XML Schema" 
xmlns :iupac="http:/ /www . iupac . org " 
id=" CS_ethane  "  > 

<formula  concise="  C  2  H  6  "/> 

<identifier  convent ion= " iupac : inchi"  value="l/C2H6/cl-2/hl-2H3"/> 
<name  convention=" IUPAC "> Ethane </ name > 

<atomArray > 

<atom  id="al"  elementType="H"  x3= " 1 . 185080 "  y3= " -0 . 003838 "  z3=" 
0.987524"/> 

<atom  id="a2"  elementType="C"  x3= " 0 . 75 162 1 "  y3= " -0 . 022441 "  z3=" 
-0.020839"/> 

<atom  id="a3"  elementType  =  "H"  x3= "  1  .  1 66929 "  y3= " 0 . 8330 1 5  "  z3=" 
-0.569312"/> 

<atom  id="a4"  elementType="H"  x3= " 1 . 1 155 19 "  y3= " -0 . 932892 "  z3=" 
-0.514525"/> 

<atom  id="a5"  elementType="C"  x3=" -0 . 751587 "  y3= " 0 . 022496 "  z3=" 
0.020891"/> 

<atom  id="a6"  elementType="H"  x3= 1 . 166882 "  y3= " -0 . 833372 "  z3=" 
0.568699"/> 

<atom  id="a7"  elementType  =  "H"  x3= " - 1 . 1 1569 1  "  y3= " 0 . 932608  "  z3=" 
0.515082"/> 

<atom  id="a8"  elementType  =  "H"  x3= 1 . 184988 "  y3= " 0 . 004424  "  z3=" 
-0.987522"/> 

</ atomArray > 

<bondArr ay  > 


<bond 

at omRef  s2  =  ' 

al 

CM 

order = ' 

1 

"/> 

<bond 

at omRef  s2  =  ' 

a2 

a3" 

order = ' 

1 

"/> 

<bond 

at omRef  s2  =  ' 

a2 

a4" 

order = ' 

1 

"/> 

<bond 

at omRef  s2  =  ' 

a2 

LO 

order = ' 

1 

"/> 

<bond 

at omRef  s2  =  ' 

a5 

a6" 

order = ' 

1 

"/> 

<bond 

at omRef  s2  =  ' 

a5 

a7" 

order = ' 

1 

"/> 

<bond 

atomRef s2=' 

a5 

00 

order = ' 

1 

"/> 

</bondArr ay  > 

<propertyList> 

<property  dictRef  =  " cml : molwt  "  t it le  =  " Molecular  weight"> 

<scalar  dataType  =  " xsd : double  "  di ctRef  =  " cml : molwt  "  unit s  =  " unit s : g 
">30.0690</scalar> 

</ property > 

<property  di ctRef =" cml : mono i sot opi cwt "  t it le = " Mono i sot opi c  weight" 
> 

<scalar  dataType=" xsd : double "  dictRef =" cml : mono i sot opi cwt "  units 
= " units : g " >30 . 0469502</ scalar > 

</ property > 

<property  di ctRef =" cml : mp "  t it le = " Melt ing  point"> 

<scalar  dataType=" xsd : double "  errorValue=" 1 . 0 "  di ctRef =" cml : mp " 
units  =  " units : Celsius " > -172</ scalar > 

</ property > 

<property  di ctRef =" cml : bp "  t it le =" Boiling  point"> 

<scalar  dataType=" xsd : double "  errorValue= " 1 . 0 "  di ctRef =" cml : bp " 
units  =  " units :celsius">-88</scalar> 
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</property > 

</ propertyList  > 
</ molecule  > 


2.4  Avogadro  1.x 

The  Avogadro  1.x  project!^®!  was  developed  as  a  library  and  desktop  application  since  its 
inception.  A  paper  was  pnblislied  in  2012  discussing  the  work  that  went  into  the  project 
leading  up  to  the  1.0  release  series, providing  a  summary  of  the  development  effort.  As 
of  this  writing  the  paper  has  received  184  citations  according  to  Google  Scholar,  and  is  the 
third  most  viewed  paper  of  all  time  in  the  Journal  of  cheminformatics  with  Figure  5  showing 
the  graphical  abstract.  After  the  paper  was  published  the  Avogadro  1.1.0  release  was  made, 
which  incorporated  some  of  the  early  improvements  developed  during  this  SBIR  project 
(largely  in  Phase  I,  and  early  in  Phase  II).  Development  in  Phase  II  of  the  project  was  split 
between  improvements  to  and  stabilization  of  features  in  Avogadro  1.1,  with  the  majority  of 
effort  redirected  to  rewriting  the  core  data  structures  and  algorithms  for  Avogadro  2,  which 
moves  from  a  GPLv2  only  license  to  a  more  permissive  3-clause  BSD  license  that  allows  for 
much  wider  use  in  all  sections  of  government,  academia,  industry,  and  education. 


Potential  Applications 


Extensions 


Tools 

Rendering 

Display 


Colors 


Scripting 


Quantum  Chemistry 
Materials  Science 
Teaching  Visualization 


Drug  Design 


7 

^  Avogadro 


Features 

Intuitive  "Drawing'  _ 
Fast  Optimization 
Results  +  Analysis 
20+  Languages 
Windows  +  Mac  +  Linux 


Extendability 
C++  Plugins 
Python  Scripting 
Open  Babel  library 
Input  Generation  for 
simulation  packages 


Figure  5:  The  graphical  abstract  for  the  Avogadro  paper. 


The  Avogadro  application  is  a  user-facing  application  capable  of  loading,  editing  and 
saving  chemical  structures  and  loading/analyzing  output  from  many  popular  computational 
chemistry  codes.  It  is  developed  as  an  open  source  community  project,  with  input  from  across 
the  industry  in  both  research  and  education.  In  the  latest  release  of  the  Avogadro  project 
(1.1.0)  significant  new  features  were  added,  such  as  support  for  directly  reading  GAMESS-US 
log  hies  among  other  new  codes,  automated  calculation  of  all  molecular  orbital  intensities  in 
the  background  to  enable  the  rapid  comparison  of  orbitals  shapes/size,  and  improved  support 
for  vibrational  mode  animations.  A  new  crystallography  extension  was  added,  providing 
signihcantly  improved  support  for  periodic  structures.  A  crystal  library  was  added  to  the 
distribution,  along  with  new  builders  such  as  a  nanotube  builder  and  chirality  inversion. 
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2.5  Avogadro  2  Libraries 

The  Avogadro  1.1  branch  continnes  to  be  developed,  bnt  development  efforts  moved  to  the 
Avogadro  2  rewrite  (along  with  porting  code).  All  major  contribntors  agreed  to  relicense 
their  contribntions  nnder  the  3-clanse  BSD  license,  with  a  new  set  of  libraries  developed 
to  serve  the  next  generation  of  chemical  manipnlation  and  visnalization  tools.  In  order  to 
serve  everything  from  desktop  applications  to  HPC  client-server  deployments,  the  libraries 
have  been  developed  using  a  much  more  modular  pattern.  In  Avogadro  1  there  was  a  single 
Avogadro  library,  and  an  application  that  linked  to  it.  Avogadro  2  features  a  number  of 
libraries  for  specific  application  areas.  This  is  needed  as  heavy  computation  needs  to  take 
place  on  HPC  systems  with  no  graphical  environment,  whereas  the  desktop  applications 
needs  a  range  of  custom  rendering  and  graphical  widgets  in  order  to  be  user  friendly  and 
easy-to-use. 


Figure  6:  The  organization  of  the  Avogadro  2  libraries  (blue  outlines),  and  some  of  their 
dependencies. 

Avogadro  2  is  a  complete  rewrite  of  the  libraries  and  applications  from  the  ground  up. 
All  of  the  core  data  structures,  APIs,  rendering  algorithms,  file  handling,  plugin  architecture 
and  interaction  were  rethought  and  written  for  both  scalability  and  streaming.  This  is  most 
apparent  when  looking  at  large  systems  in  Avogadro  1.x  and  Avogadro  2 — systems  exceeding 
a  few  thousand  atoms  in  size  were  difficult  to  load  and/or  interact  with  previously,  becoming 
very  apparent  above  ten  thousand  atom  systems.  Avogadro  2  has  been  demonstrated  loading 
and  interacting  with  systems  containing  in  excess  of  2.8  million  atoms,  with  frame  rates 
remaining  interactive  and  load  times  reasonable.  These  represent  some  of  the  largest  systems 
most  groups  are  interested  in  simulating  using  molecular  dynamics  at  the  current  time,  and 
it  is  clear  that  it  would  be  possible  to  look  at  even  larger  systems. 
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2.5.1  Core  and  lO  Libraries 

The  load  and  interaction  improvements  have  been  achieved  by  building  a  core  library  that 
has  very  few  dependencies,  avoiding  any  graphical  operations.  Data  structures,  core  algo¬ 
rithms,  and  common  dehnitions  are  implemented  here.  A  small,  focused  input/output  library 
extends  the  core  library  and  adds  basic  hie  10,  with  new  dependencies  on  small  JSON  and 
XML  parsing  libraries,  and  Boost  in  order  to  efficiently  implement  core  hie  format  support. 
The  HDF5  library  is  needed  by  the  10  library;  most  other  hie  formats  will  be  translated  to 
these  core  formats  using  tools  such  as  Open  Babel,  Chemkit,  and  RDKit.  The  hie  10  layer 
is  extensible,  and  manages  registered  format  handlers  implementing  some  project  specihc 
formats  such  as  Chemical  JSON  described  earlier,  as  well  as  common  chemical  hie  formats 
such  as  CML,  XYZ,  MDL,  and  the  GROMACS  .gro  format.  A  snippet  from  the  XYZ  for¬ 
mat  shows  how  simple  the  read  function  is,  where  the  format  developer  implements  a  simple 
method  taking  the  input  stream  and  a  molecule  object  as  parameters,  and  populates  the 
molecule  object. 

bool  XyzFormat : : read ( std : : istream  feinStream,  Core :: Molecule  &mol) 

{ 

size_t  numAtoms  =  0; 

if  (KinStream  >>  numAtoms))  { 

appendErr or ( " Err or  parsing  number  of  atoms."); 
return  false ; 

} 

std:: string  buffer; 

std :  : getline ( inStream  ,  buffer);  //  Finish  the  first  line 

std :  : getline ( inStream  ,  buffer); 

if  (! buf f er . empty () ) 

mol . setData("name" ,  trimmed (buffer)) ; 

//  Parse  atoms 

unsigned  char  atomicNum; 

Vectors  pos ; 

for  (size_t  i  =  0;  i  <  numAtoms;  ++i)  { 

if  (inStream  >>  buffer  && 
inStream  >>  pos.xO  && 
inStream  >>  pos.yO  && 
inStream  >>  pos.zO)  { 
if  (!  buf f er . empty  ()  )  { 

if  (  isalpha (buf f er  [0] ) )  { 

atomicNum  =  Elements :: atomicNumberFromSymbol (buf fer) ; 

} 

else  { 

short  int  atomicNumlnt  =  0; 

std : : istringstream (buf f er )  >>  atomicNumlnt; 

atomicNum  =  static_cast <unsigned  char >( atomicNumlnt ) ; 

} 

Atom  newAtom  =  mol . addAtom ( atomicNum) ; 
newAtom. setPosition3d(pos) ; 
continue ; 

> 

} 
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break ; 

} 

//  Check  that  all  atoms  were  handled. 

if  (mol . atomCount ()  !=  numAtoms)  { 

std : : ostr ingstream  errorStream; 

errorStream  <<  "Error  parsing  atom  at  index  "  <<  mol . atomCount () 

<<  "  (line  "  <<  3  +  mol . at omCount ( )  << 

appendError (errorStream . str () ) ; 
return  false ; 

} 

return  true ; 

> 

A  fast  and  lightweight  CML  reader  and  writer  has  been  developed  using  the  PugiXML 
library  for  XML  parsing,  and  the  HDF5  library  for  storing  large  amounts  of  data  in  a  binary 
format.  Signihcant  improvements  in  load  time  and  memory  utilization  have  been  achieved 
over  previous  implementations,  along  with  simpler  code  that  can  be  more  easily  extended 
in  the  future  as  more  features  are  required.  The  CML  and  HDF5  reader/writer  has  been 
developed  as  part  of  this  project,  and  discussed  with  experts  in  the  held  as  an  approach 
to  creating  standards-compliant  and  scalable  formats  for  use  in  chemistry.  Converters  from 
other  formats  are  also  being  developed  using  two  approaches:  the  extension  of  Open  Babel 
which  already  supports  a  vast  array  of  formats,  and  the  development  of  JUMBO  converters 
in  collaboration  with  others  in  the  held  directly  to  CML. 

2.5.2  Molecule  Classes 

Previously  the  molecule  class  of  Avogadro  was  deeply  entwined  with  that  of  Open  Babel, 
and  one  class  was  used  for  everything.  In  Avogadro  2  after  a  great  deal  of  thought  three 
molecule  classes  were  created.  The  core  Avogadro  library  houses  the  hrst,  which  implements 
a  Core:: Molecule  class.  A  hyweight  proxy  pattern  was  employed  for  the  molecule’s  atoms 
and  bonds,  in  stark  contrast  to  previously  used  approaches.  Atoms  and  bonds  are  temporary 
objects,  containing  only  a  pointer  to  their  parent  molecule  and  their  index — along  with  all 
expected  API.  The  molecule  has  a  series  of  data  arrays  that  contain  the  atom’s  atomic 
number,  position,  etc.  Properties  that  are  not  used  remain  as  zero-length  arrays  in  the 
molecule,  and  are  only  allocated  when  values  are  set. 

The  molecules  leverage  a  copy-on-write  pattern  for  heavy  data  storage,  whereby  a  copy 
of  a  molecule  will  create  a  copy  of  all  arrays  in  the  old  molecule  in  the  new  one.  The  arrays 
increment  the  reference  count  of  the  inner  data,  but  do  not  make  a  copy  of  the  memory 
unless  a  non-const  method  is  called.  In  this  way  very  cheap  copies  of  large  molecules  can 
be  created,  and  temporary  molecules  are  able  to  effectively  pass  along  their  underlying  data 
without  creating  multiple,  redundant,  in-memory  copies  of  heavy  data.  This  also  makes 
hie  input/output  much  more  efficient  as  iterating  through  all  stored  position  coordinates  is 
a  simple  traversal  of  a  linear  array,  allocating  these  buffers  for  a  known  input  can  involve 
just  one  initial  memory  allocation  reducing  memory  fragmentation.  The  code  listing  below 
demonstrates  how  simple  it  is  to  create  a  molecule,  and  work  with  the  proxy  atom/bond 
objects  in  the  molecule  classes. 


Distribution  Statement  A:  Distribution  is  approved  for  public  release;  distribution  is  unlimited. 


15 


Topic  AlO-llO 


Proposal  A2-4714 


Kitware,  Inc. 


Avogadro : 

: Core  : 

; Molecule  mol ; 

Avogadro : 

: Core  : 

:  Atom  ol  =  mol . addAtom (8)  ; 

Avogadro : 

: Core  : 

:  Atom  h2  =  mol . addAtom  (  1)  ; 

Avogadro : 

: Core  : 

: Atom  hS  =  mol . addAtom ( 1) ; 

ol . setPositionSd (Vectors (0 ,  0,  0)); 

h2 . setPositionSd (Vectors (0 . 6 ,  -0.5,  0)); 

hS . setPo s it i onSd ( Vectors ( -0 . 6 ,  -0.5,  0) ) ; 

Avogadro : 

: Core  : 

: Bond  bl  =  mol . addBond ( ol  , 

h2. 

1)  ; 

Avogadro : 

: Core  : 

: Bond  b2  =  mol . addBond ( ol , 

hS  , 

1)  ; 

The  QtGui;:Molecule  class  inherits  from  Core  "Molecule  and  Qt’s  QObject  so  that  it  can 
participate  in  the  Qt  framework.  It  adds  signals  and  slots,  as  well  as  the  simple  parenting 
offered  by  the  Qt  framework  for  object  lifetime  management.  This  is  the  object  used  primar¬ 
ily  by  the  Avogadro  2  application;  it  is  passed  to  the  hie  formats  which  take  the  molecule  to 
be  populated  as  an  input  (although  they  are  all  implemented  in  terms  of  the  Core:: Molecule 
class. 

The  third  is  QtCui::RWMolecule  which  inherits  from  just  QObject,  but  shares  a  common 
API  with  the  other  two  classes.  The  QtCui:: Molecule  class  and  QtCui::RWMolecule  classes 
offer  fast  conversion  from  one  to  the  other,  and  the  “editable”  molecule  is  highly  specialized 
for  molecular  editing.  It  only  supports  atoms,  bonds,  and  limited  metadata,  leaving  more 
advanced  objects  such  as  quantum  data,  cubes,  meshes  etc.  to  the  other  classes.  It  builds 
and  maintains  an  internal  undo/redo  stack,  removing  the  burden  from  the  mouse  interaction 
tools  to  implement  undo/redo  operations.  It  uses  the  same  array  classes  that  offer  copy- 
on- write  functionality,  meaning  that  conversion  to  QtCui:  :Molecule  copies  very  little  data 
unless  further  changes  are  made  to  the  editable  molecule. 

The  molecule  classes  are  also  supported  by  support  classes  that  provide  atomic  data, 
such  as  atomic  radii,  default  colors,  etc.  These  functions  are  signihcantly  faster  due  to  their 
use  of  linear  arrays  indexed  by  atomic  number,  further  increasing  efficiency  when  working 
with  large  structures. 

2.5.3  Periodic  Structures 

Support  for  periodic  systems  was  implemented,  with  options  to  display  and  edit  the  unit 
cell,  as  well  as  perform  numerous  operations  on  the  cell.  These  concepts  were  added  as 
optional  properties  on  molecules,  and  several  of  the  native  hie  formats  include  full  support 
for  expressing  periodic  structure.  Figure  14  shows  a  large  structure  from  a  CROMACS 
simulation  with  the  unit  cell  for  the  periodic  boundaries  displayed. 

2.5.4  Rendering  and  Graphical  Libraries 

The  rendering  data  structures,  support  classes,  and  code  reside  in  a  rendering  library  that 
depends  upon  OpenCL  and  CLEW,  but  remains  independent  of  the  graphical  user  interface 
toolkit  employed.  This  opens  up  the  possibility  of  deploying  the  rendering  code  in  a  wider 
variety  of  environments,  but  requires  integration  with  Qt  in  order  to  open  up  windows,  handle 
user  interaction,  and  perform  other  common  tasks.  This  is  implemented  in  a  Qt  OpenCL 
library  that  provides  customized  OpenCL  render  windows  and  related  functionality.  Finally, 
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the  desktop  widgets  necessary  for  nser  interaction  withont  OpenGL  is  in  a  Qt  GUI  library,  to 
enable  rense  in  non-OpenGL  applications.  Plugin  location,  loading,  and  lifetime  management 
are  also  implemented  in  the  Qt  GUI  library. 

Signihcant  advances  in  the  rendering  model  have  focused  on  leveraging  OpenGL  2.1  in 
order  to  maximize  performance  with  good  hardware  support.  Major  advantages  include 
vertex  buffer  objects,  vertex  and  fragment  shader  programs,  and  optimized  memory  layouts 
for  rendering.  One  of  the  largest  bottlenecks  in  the  rendering  pipeline  in  chemical  structures 
is  sphere  rendering,  which  has  been  signihcantly  mitigated  through  the  use  of  impostor 
sphere  rendering.  This  approach  uses  a  point  sprite  or  single  quad  to  represent  a  sphere,  the 
vertex  program  applies  a  billboarding  transform  to  ensure  the  quad  faces  the  camera  and 
has  the  correct  dimensions  in  eye  space.  The  fragment  shader  then  applies  lighting  equations 
and  an  implicit  function  to  update  the  depth  buffer  with  the  correct  values  to  interact  with 
standard  rendering  techniques.  This  leads  to  a  highly  optimized  rendering  scene  where  only 
four  points  per  sphere  need  to  be  transformed;  and  sphere  ray-tracing  equations  can  be 
applied  on  a  per  pixel  basis,  offering  not  only  much  improved  rendering  speed,  but  pixel 
perfect  sphere  boundaries  due  to  the  use  of  an  implicit  sphere  rather  than  some  approximate 
triangulation  method  as  is  traditionally  used.  The  results  can  be  seen  in  Figure  7. 


Figure  7:  Van  der  Waals  rendering  using  impostor  sphere  rendering. 

Ideally  a  similar  approach  would  be  used  for  cylinders,  but  the  gains  are  lower  (cylinders 
are  more  complex  to  model  in  this  way  and  require  fewer  triangles  than  spheres).  Figure  8 
shows  a  typical  ball  and  stick  representation,  and  the  seamless  joins  achieved  between  the 
ray-traced  spheres  and  the  triangulated  cylinders.  Through  the  use  of  optimized  data  struc¬ 
tures  and  vastly  improved  rendering  techniques  structures  containing  hundreds  of  thousands 
of  atoms  can  now  be  rendered  interactively  on  commodity  laptops  when  in  Avogadro  1.x 
thousands  of  atoms  would  already  be  displaying  performance  degradation. 
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Figure  8:  Ball  and  stick  rendering  using  impostor  spheres  and  triangulated  cylinders, 
saved  directly  to  PNG  from  the  application. 

The  improvements  in  low  level  rendering  capabilities  using  OpenGL  2.1,  and  modern 
shader  language  approaches  are  further  augmented  through  the  use  of  a  custom  scene  graph 
implementation.  The  scene  graph  is  becoming  increasingly  accepted  as  the  best  pattern 
to  take  maximum  advantage  of  modern  graphics  cards,  and  is  commonly  used  in  many 
rendering  engines  from  the  latest  blockbuster  video  games  through  to  rendering  engines  for 
movies  and  cartoons  to  real-time  ray  tracing  engines.  This  adds  a  signihcant  amount  of 
API  and  infrastructure  to  the  rendering  code  in  Avogadro  2,  but  pays  off  with  a  simple 
user-facing  API  coupled  with  extremely  efficient  batching  of  rendering. 

In  a  typical  scene  any  given  view  might  use  spheres,  cylinders,  triangle  meshes,  and 
text.  Using  the  traditional  immediate  mode  techniques  these  graphical  primitives  would 
be  rendered  in  the  order  they  were  encountered,  but  with  a  scene  graph  multiple  passes 
are  used.  The  hrst  pass  involved  going  from  the  molecular  model  to  the  graphical  model, 
which  is  where  view  plugins  transform  atoms,  bonds,  surfaces,  etc.  into  spheres,  cylinders, 
triangle  meshes,  etc.  This  can  now  be  pushed  off  into  background  threads  as  they  are  simply 
building  in-memory  representations  of  the  molecule  ready  for  the  rendering  pass.  On  the  hrst 
rendering  pass  the  scene  graph  drawable  items  translate  the  graphical  primitives  into  vertex 
buffer  objects  (VBOs)  that  are  uploaded  to  fast  GPU  memory,  along  with  the  appropriate 
uniforms  and  shader  programs  necessary  to  render. 

Once  the  geometry  has  been  uploaded  to  GPU-resident  memory  the  actual  render  calls 
act  in  large  batched  operations — draw  all  spheres,  draw  all  cylinders,  draw  all  meshes,  etc. 
These  calls  apply  the  shader  code  to  the  geometry  in  memory,  and  when  the  camera  is 
changed  to  look  at  the  structure  from  a  different  angle  or  zoom  level  the  camera  matrix  is 
updated  but  all  other  state  remains  unaffected,  meaning  that  virtually  no  data  needs  to  be 
uploaded  to  the  graphics  card  for  the  next  frame.  This  coupled  with  the  orders  of  magnitude 
decrease  in  vertex  counts  for  spheres  has  enabled  rendering  to  go  from  thousands  of  spheres 
in  Avogadro  1.x  to  millions  of  spheres  in  Avogadro  2.  The  signihcantly  improved  core 
molecule  data  structure  (with  all  3D  coordinates  being  in  contiguous  memory  for  example) 
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also  enables  for  significantly  improved  initial  render  times. 

2.5.5  Client-Server  and  Interprocess  Communication 

In  addition  to  lower  level  changes  to  data  strnctnres,  rendering  and  plngins,  a  scalable,  client- 
server  oriented  architectnre  has  been  developed,  yielding  fast  serialization/deserialization  of 
data  and  simple  migration  of  objects  from  one  process  to  another  (whether  local  or  remote). 
This  architectnre  is  exposed  as  a  generic  library  that  leverages  Google’s  protobnf  project 
for  fast,  binary  commnnication,  with  specialized  helper  classes  in  Avogadro  2  libraries  to 
make  transfers  of  common  data  strnctnres  simpler.  There  is  a  rnntime  loaded  plngin  that 
exposes  this  in  the  application,  and  facilities  for  remote  file  browsing  with  all  commnnication 
happening  over  a  standard  TCP/IP  connection.  The  first  screen  captnre  of  this  is  shown  in 
Fignre  9. 


File  Edit  View  Crystal  Extensions  Quantum  Jesting  Help 
New  Open  ‘T  Import 


_  Ball  and  Stick 
_  Crystal  Lattice 
•y  Licorice 
_  Meshes 

'/  Reference  Axes  Overlay 
_  QTAIM 
_  VanderWaals 
Van  der  Waals  (AO) 


Navigate 


Cancel 


Figure  9:  The  Avogadro  2  application  displaying  a  molecule  loaded  on  a  remote  system 
using  the  client-server  functionality. 


Integration  with  MoleQnene  and  MongoChem  offers  seamless  commnnication  of  molec- 
nlar  data  between  components,  opening  selected  results  from  MongoChem  in  Avogadro  2, 
or  new  calculations  performed  by  MoleQnene  automatically  in  Avogadro  2  once  they  are 
finished.  The  original  code  was  developed  in  the  MoleQnene  project  (described  later),  and 
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generalized  so  that  all  three  applications  conld  easily  set  np  local  socket  connections  and  lis¬ 
ten  for  JSON-RPC  2.0  calls.  The  code  implementing  the  calls  is  simple,  and  can  be  extended 
with  minimal  effort  to  snpport  new  interactions  as  desired. 

2.5.6  Input  Generators  in  Separate  Processes 

In  addition  to  the  previous  approach  taken  to  adding  new  input  generators  using 
development  of  a  new  methodology  has  also  been  developed.  Instead  of  writing  a  plugin 
in  C-|--|-  or  Python  using  the  wrapped  API  of  Avogadro,  calling  a  script  directly  from  the 
Avogadro  application  in  a  separate  process  is  now  possible. 

This  suffers  from  a  higher  startup  cost,  but  benefits  greatly  from  the  level  of  simplicity 
in  designing,  adding  and  editing  input  generators.  All  molecular  geometry  and  calculation 
settings  can  be  passed  in  using  a  JSON  input  to  the  Python  (or  any  other  language)  process, 
and  the  generated  output  can  be  passed  back  to  the  application  using  the  standard  output  of 
the  process.  Most  languages  have  support  for  JSON,  and  can  parse  it  very  efficiently.  Calling 
an  independent  process  removes  any  issues  around  consistent  linking  of  the  plugin,  licensing 
issues  and  complexity  of  learning  a  new  API.  This  approach  can  be  used  in  combination 
with  the  previous  approach,  with  some  input  file  generators  being  C-I--I-  plugins,  and  others 
being  implemented  in  these  independent  scripts.  Should  the  plugin  crash  or  be  unreliable 
it  will  not  crash  the  main  application  process,  and  the  user  can  be  informed  of  the  issue. 
Examples  of  C-I--I-  and  Python  input  generators  are  shown  in  Figure  10 — both  feature  syntax 
highlighting. 
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Figure  10:  The  GAMESS  input  generator  (left)  implemented  in  C-|— 1-,  and  NWChem 
(right)  implemented  in  Python. 


This  also  opens  up  the  possibility  of  much  simpler  packaging,  sharing  and  independent 
releases  of  input  generator  plugins.  They  can  be  downloaded  and  placed  in  an  appropriate 
location.  Avogadro  2  takes  care  of  calling  them  using  several  entry  points  asking  for  the  menu 
entry  to  be  added  to  the  application,  supported  options  and  desired  molecular  geometry 
format.  The  plugins  can  also  return  syntax  highlighting  rules,  which  are  then  loaded  by  the 
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input  generator  and  shown  in  real  time  when  the  user  hand  edits  the  input  hie.  Concerns 
over  how  well  this  might  scale  to  larger  structures  have  also  been  mitigated  by  providing 
a  geometry  specihcation  syntax,  where  the  plugin  can  specify  how  the  molecular  geometry 
should  be  passed  and  offering  a  known  keyword  for  later  replacement.  The  framework  also 
supports  the  generation  of  multiple  input  files. 

2.5.7  File  Format  Extensions  in  a  Separate  Process 

Early  on  in  Avogadro  2’s  development  there  was  a  strong  desire  to  maintain  support  for 
the  large  number  of  file  formats  supported  by  Open  Babel  without  linking  to  the  Open 
Babel  library  due  to  licensing  and  program  stability  concerns.  This  led  to  the  design  of 
a  meta-format  plugin  that  queried  the  obabel  command  line  executable  for  all  supported 
formats,  added  them  to  the  Avogadro  2  formats,  and  integrated  them  into  the  applications 
10  routines.  This  was  achieved,  and  found  to  work  well  using  obabel  to  convert  to  CML, 
XYZ,  or  MDL  as  appropriate  making  a  call  in  a  separate  process  to  obabel  for  each  file  to 
be  opened  or  saved. 

This  approach  was  then  extended,  following  a  similar  pattern  to  the  input  generators 
in  a  separate  process  described  above.  There  are  a  set  of  known  entry  points  that  must  be 
implemented,  informing  the  application  if  the  format  can  read,  write,  or  both,  along  with  the 
requested  format  for  writing  or  the  format  the  application  should  expect  for  reading.  The 
files  are  then  passed  to  the  plugin  as  requested,  which  is  expected  to  perform  the  translation 
to/from  the  format  it  implements.  This  again  enables  the  simple  extension  of  the  application 
using  Python  scripts  (or  any  other  language),  and  new  formats  can  be  seamlessly  added  to  the 
application.  From  the  end  user’s  perspective  these  plugins  are  indistinguishable  from  native 
file  formats  and/or  input  generators,  though  it  should  be  noted  they  suffer  from  the  inherent 
overhead  of  starting  a  distinct  process  for  each  call,  and  use  the  standard  input /output 
streams  to  avoid  complications  with  temporary  files. 

2.5.8  OpenQube — Moved  into  Avogadro  Libraries 

The  OpenQube  project  began  as  part  of  an  Avogadro  plugin,  and  was  later  split  out  in  order 
to  make  it  more  useful  in  other  applications.  It  originally  contained  a  minimal  molecule  data 
structure,  which  has  since  been  ported  to  reuse  the  structure  implemented  in  Avogadro:: Core 
and  moved  to  a  library  in  the  Avogadro  2  libraries  repository — Avogadro:: Quantum.  Sup¬ 
port  for  a  variety  of  output  file  formats,  such  as  GAMESS,  GAMESS-UK,  MOPAC,  and 
Molden  have  been  added.  Experimental  support  for  CMLComp  is  also  being  developed, 
with  a  collaboration  between  NWChem,  Open  Chemistry,  and  community  members  explor¬ 
ing  augmenting  NWChem  and  Avogadro  2  with  CMLComp  support,  and  developing  new 
converters  to  go  from  log  file  formats  to  the  CMLComp  format.  The  CMLComp  conven¬ 
tion  is  being  developed  in  a  larger  collaboration,  with  XML  dictionaries  being  developed  to 
extend  CML  for  use  in  computational  chemistry. 

Further  generalization  of  OpenQube  also  took  place,  in  order  to  develop  a  more  widely- 
applicable  and  efficient  data  structure  where  basis  sets  can  be  shared  between  multiple 
atoms,  support  for  UHF  as  well  as  RHF  and  higher  order  Gaussian  type  orbital  functions. 
The  OpenQube  code  had  a  hard  dependency  on  Qt  for  the  QtConcurrent  parallelization 
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framework,  this  was  removed  with  a  simple  serial  implementation  in  the  core  library  and 
the  parallel  QtConcurrent  based  approach  moved  to  an  Avogadro  2  plngin.  This  adds  the 
possibility  of  client-server  molecnlar  orbital  and  electron  density  calculations,  which  are  well 
suited  to  this  approach  due  to  the  highly  CPU  bound  nature  of  the  calculations. 

2.5.9  Mouse  Interaction  Tools 

Due  to  the  creation  of  multiple  molecule  classes  the  complexity  of  the  tools  plugins  increased 
a  little,  but  this  was  offset  by  the  centralization  of  the  undo/redo  management.  Several  tools 
were  ported  from  Avogadro  1.x,  although  these  largely  became  rewrites  due  to  significant 
changes  to  the  core  APIs,  and  simplifications  that  became  more  obvious  when  reexamining 
previous  approaches. 


Figure  11:  The  bond-centric  tool,  rewritten  but  retaining  most  functionality  from  Avo¬ 
gadro  1.x. 


Figure  11  shows  the  bond-centric  tool.  The  edit  tool  features  an  improved  periodic 
table  widget  that  can  be  resized  dynamically.  Edit  widgets  are  only  enabled  if  the  editable 
molecule  has  been  used,  the  general  tools  are  designed  to  work  with  both  structures.  Support 
for  playing  molecular  trajectories  was  also  added  as  a  tool  in  order  to  enable  fuller  interaction, 
and  support  for  large  trajectory  files  has  been  demonstrated. 
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2.5.10  Scene  Plugins 


All  3D  views  of  the  molecular  structure  are  transformed  from  input  data  to  objects  that  can 
be  rendered  in  the  scene  graph  using  scene  plugins.  These  classes  have  a  number  of  entry 
points,  but  the  main  ones  are  the  process  methods  that  take  the  molecule  to  be  rendered, 
and  a  node  in  the  scene.  They  populate  the  node  with  graphical  primitives  to  be  rendered, 
as  shown  in  the  listing  below  taken  from  the  ball  and  stick  plugin. 

void  BallAndStick process ( const  Molecule  femolecule , 

Rendering :: GroupNode  fenode) 

{ 

GeometryNode  *geometry  =  new  GeometryNode ; 
node  .  addChild (geometry)  ; 

//  Add  a  sphere  geometry  drawable  to  contain  all  of  the  spheres  . 

SphereGeometry  ^spheres  =  new  Spher eGeometry ; 

spheres -> ident if ier (). molecule  =  femolecule; 

spheres -> ident if ier (). type  =  Rendering :: AtomType ; 

geometry -> addDrawable ( spheres ) ; 

for  (Index  i  =  0;  i  <  molecule . atomCount () ;  ++i)  { 

Core :: Atom  atom  =  molecule . atom ( i ) ; 

unsigned  char  atomicNumber  =  atom . atomicNumber () ; 
if  (atomicNumber  ==  1  fefe  ! m_showHydrogens ) 
cont inue ; 

const  unsigned  char  *c  =  Element s :: color ( at omi cNumber ) ; 

VectorSub  color(c[0],  c[l],  c[2]); 

spheres -  >addSphere(  at  om.positionSdO  .cast<float>()  ,  color  , 

static_cast<float> (Elements : : radiusVDW( 
atomicNumber) ) 

*  0.3f ) ; 

} 

float  bondRadius  =  O.lf; 

//  Add  a  cylinder  geometry  drawable  to  contain  all  of  the  cylinders  . 

CylinderGeometry  ^cylinders  =  new  CylinderGeometry ; 

cylinder s -> ident if ier (). mole cule  =  femolecule; 

cylinder s -> ident if ier (). type  =  Rendering :: BondType ; 

geometry -> addDrawable (cylinders) ; 

for  (Index  i  =  0;  i  <  molecule . bondCount () ;  ++i)  { 

Core :: Bond  bond  =  molecule . bond  (  i  )  ; 
if  ( ! m_showHydr ogens 

fefe  (bond.atomlO  .  atomicNumber  ()  ==  1  II  bond.atom2(). 
atomicNumber ()  ==  1))  { 
cont inue ; 

} 

VectorSf  posl  =  bond . atoml (). positionSd (). cast <float >() ; 

VectorSf  pos2  =  bond . atom2 (). positionSd (). cast <float >()  ; 

VectorSub  colorl (Elements : : color (bond . atoml () . atomicNumber () ) ) ; 
VectorSub  color2 (Elements : ; color (bond . atom2 () . atomicNumber () ) ) ; 
VectorSf  bondVector  =  pos2  -  posl; 
float  bondLength  =  bondVector . norm () ; 
bondVector  /=  bondLength; 

switch  (m_multiBonds  ?  bond.orderO  :  1)  { 

case  S:  { 
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VectorSf  delta  =  bondVector . unitOrthogonal ()  *  (2. Of  * 

bondRadius  )  ; 

cylinders ->addCylinder (posl  +  delta,  bondVector,  bondLength , 
bondRadius  , 

colorl  ,  color2  ,  i)  ; 

cylinders ->addCylinder (posl  -  delta,  bondVector,  bondLength, 
bondRadius  , 

colorl  ,  color2  ,  i)  ; 

} 

def  ault  : 
case  1 : 

cylinders ->addCylinder (posl ,  bondVector,  bondLength,  bondRadius, 

colorl  ,  color2  ,  i)  ; 

break  ; 
case  2:  { 

VectorSf  delta  =  bondVector . unitOrthogonal ()  *  bondRadius; 

cylinders ->addCylinder (posl  +  delta,  bondVector,  bondLength, 
bondRadius  , 

colorl  ,  color2  ,  i)  ; 

cylinders ->addCylinder (posl  -  delta,  bondVector,  bondLength, 
bondRadius  , 

colorl  ,  color2  ,  i)  ; 

} 

} 

} 

} 

This  listing  demonstrates  how  simple  the  scene  API  remains,  enabling  developers  of  scene 
plugins  to  focus  on  the  geometry  rather  than  worrying  about  OpenGL,  global  state,  batching 
of  draw  calls,  selection,  etc.  Settings  such  as  whether  to  show  hydrogen  atoms  are  shown, 
and  configured  from  the  plugins  configuration  dialog  (shown  in  the  application).  The  full 
source  of  the  display  plugins  shows  more  of  the  detail,  but  the  above  should  be  enough  to 
demonstrate  how  simple  it  is  to  add  new  visualizations  to  Avogadro  2.  This  approach  has 
been  used  to  process  millions  of  atoms,  add  them  to  the  scene  and  render  them  interactively 
on  consumer-grade  laptops  running  Linux,  Mac  OS  X  and  Windows. 


2.5.11  Extension  Plugins 

Extensions  that  are  primarily  commands  executed  from  the  application  menus,  or  additional 
hie  formats,  utility  functionality,  etc,  were  implemented  in  the  extension  plugin  framework. 
These  extensions  can  add  entries  to  the  menu,  which  are  dynamically  created  when  the  ap¬ 
plication  is  loaded.  They  may  open  dialogs,  such  as  the  input  generator  plugins,  calculate 
new  derived  quantities,  such  as  the  quantum  plugins,  or  add  support  for  new  hie  types,  such 
as  the  Open  Babel  plugin  and  Python  10  plugin.  Operations  on  the  molecule  model,  such  as 
bonding,  hydrogen  addition/removal,  and  geometry  optimization  are  also  possible.  Integra¬ 
tion  with  online  databases,  downloading  structures  by  name  and  hnding  similar  molecules 
to  the  currently  shown  molecule  were  all  implemented  as  simple  extension  plugins  that  are 
loaded  dynamically.  These  could  easily  be  extended,  or  modihed  to  use  diherent  data  sources, 
or  extend  the  queries  made. 

The  QTAIM  plugin  is  perhaps  the  one  that  contains  the  largest  amount  of  code  at  this 
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Figure  12:  The  Avogadro  2  rendering  of  a  wave  function  file  showing  QTAIM  results. 


stage.  It  was  ported  from  the  code  contributed  to  Avogadro  1.x,  and  could  probably  be 
significantly  optimized.  The  basic  of  the  code  work,  and  it  is  now  more  fully  integrated  than 
was  ever  before  possible.  There  are  also  some  small  pieces  of  mathematical  code,  limited 
to  the  plugin,  that  are  GPL-licensed,  this  causes  the  plugin  to  be  GPL-licensed  too  due  to 
the  copyleft  licensing  terms  at  this  stage.  An  example  rendering  is  shown  in  Figure  12,  this 
component  was  used  in  a  recent  course  at  the  University  Gollege  London  (UGL). 

2.5.12  VTK  Integration 

The  Visualization  Toolkit  (VTK)  offers  signihcant  visualization  and  analysis  capabilities. 
One  of  the  primary  motivating  factors  for  supporting  multiple  view  widget  types  was  to 
add  a  VTK  widget  to  the  main  application.  This  has  made  it  possible  to  offer  side-by-side 
comparisons  of  electronic  structure  using  simple  isosurfaces  and  volume  rendered  geometry. 
In  order  to  offer  seamless  integration  a  vtkAvogadroActor  class  was  developed,  taking  the 
scene  graph  based  rendering  used  in  the  main  Avogadro  2  views,  and  placing  them  in  the 
rendered  3D  view  alongside  native  VTK  visualization.  This  gives  a  visually  appealing  so- 
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lution  where  the  ball  and  stick  rendering  is  completely  identical,  with  a  volume  rendering 
overlaid  and  the  same  scene  plugins  can  be  reused  with  no  code  modihcations.  Figure  13 
shows  this  capability  in  the  two  widgets  rendered  on  the  left  of  the  application. 

2.6  Avogadro  2  Application 

The  Avogadro  2  application  has  also  seen  quite  significant  changes,  largely  due  to  the  sig- 
nihcant  changes  in  the  Avogadro  2  libraries.  The  application  remains  focused  on  providing 
an  end-user  application,  ready  for  use  by  non-programmers.  It  makes  use  of  the  file  format 
framework,  and  moves  all  file  loading/saving  to  background  threads.  The  program  supports 
several  command  line  options,  and  now  offers  an  RPC  server  that  responds  to  JSON-RPC 
2.0  calls  listening  on  a  local  names  socket. 

The  application  has  been  ported  to  use  the  latest  generation  of  the  Qt  libraries,  with 
signihcant  changes  to  the  core  model  to  support  multiple  molecules  in  a  single  window,  as 
well  as  multiple  view  widgets/widget  types,  as  shown  in  Figure  13.  These  changes  make  it 
possible  to  dynamically  add  a  range  of  display  types,  seamlessly  mixing  native  Avogadro  2 
display  widgets  with  VTK  widgets.  In  the  future  this  framework  could  easily  support  yet 
more  widget  types,  including  web-based  views  seamlessly  integrating  dynamic  web  content 
with  the  natively  rendered  views. 


Figure  13:  The  Avogadro  2  application  shown  displaying  different  rendering  styles  of 
different  molecule  views. 

The  application  is  in  a  separate  repository,  ensuring  complete  independence  from  the  li¬ 
braries,  and  acting  as  a  living  example  of  how  a  custom  application  can  be  developed  based  on 
the  Avogadro  2  libraries.  This  acts  as  one  of  the  most  complete  custom  applications,  offering 
the  most  options  along  with  automated  packaging/integration  capabilities.  The  application 
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features  integration  with  MoleQueue,  MongoChem  and  some  online  chemical  databases.  The 
editable  molecule  widget  uses  the  editable  molecule  that  features  native  undo/redo  support. 
This  application  serves  as  the  full-featured  demonstration  of  capabilities  in  the  libraries  de¬ 
veloped.  It  can  be  installed  alongside  Avogadro  1.x,  and  while  its  capabilities  exceed  those 
of  Avogadro  1.x  in  many  areas,  there  are  still  features  that  have  yet  to  be  ported.  It  has  a 
similar  yet  distinct  icon  to  highlight  the  heritage  of  the  project,  but  the  distinct  differences 
present  in  this  rewrite  that  Avogadro  2  was  developed  as  part  of  the  Phase  II  SBIR  project. 

The  application  has  been  demonstrated  editing  new  structures,  minimizing  the  structure, 
generating  input  for  a  number  of  codes,  submitting  them  using  MoleQueue,  and  visualizing 
the  electronic  structure  of  the  output.  It  has  also  been  shown  loading  structures  in  excess 
of  2.8  million  atoms  on  a  standard  laptop,  rendering  it  interactively,  and  even  showing 
multiple  large  and  small  structures  side-by-side.  The  large  structure  is  shown  in  Figure  14. 
Trajectories  from  molecular  dynamics  calculations  can  be  loaded,  and  animated  in  the  main 
interface.  Support  for  periodic  structures  is  also  present,  along  with  client-server  capabilities 
that  enable  interaction  with  large  data  on  remote  systems  using  a  server-resident  process. 


A  w 
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Figure  14:  The  Avogadro  2  application  displaying  a  GROMACS  structure,  kindly  shared 
by  Peter  Tieleman,  with  over  2.8  million  atoms. 


This  application  represents  a  signihcant  step  forward  in  capabilities,  positioned  to  make 
a  signihcant  impact  on  the  held.  It  is  easily  extensible  with  modest  levels  of  expertise,  and 
can  handle  very  large  and  very  small  structures  equally  well.  The  layout  of  the  libraries, 
plugin  interfaces  and  licensing  make  it  especially  amenable  to  use  in  various  DoD  projects 
where  custom  functionality  can  be  implemented  for  specialized  application  areas. 
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2.7  MongoChem 

The  MongoChem  application  is  developed  using  C++,  Qt,  MongoDB,  VTK,  and  the  Avo- 
gadro  2  libraries.  Some  of  its  cheminformatics  functionality  coming  from  Open  Babel  and 
Chemkit.  This  application  is  focused  on  facilitating  the  deposition  of  chemical  data  in  a 
scalable  database,  adding  various  properties  to  the  molecules,  and  facilitating  the  dynamic 
visualization  and  analysis  of  aggregate  data.  It  has  been  generalized  to  support  connections 
to  different  MongoDB  database  servers,  including  the  use  of  shared  servers,  in  the  cases 
where  very  large  databases  are  required. 


Figure  15:  The  scatter  plot  matrix  view  showing  multiple  . 


The  range  of  charts  available  has  been  extended  to  include  simple  histograms  and  scat¬ 
ter  plots,  through  to  multivariate  visualization  techniques  such  as  parallel  coordinates  and 
scatter  plot  matrices  (which  combine  scatter  plots  for  multiple  dimensions,  along  with  pop¬ 
ulation  histograms  for  each  variable  and  linked  selections,  shown  in  Figure  15).  The  charts 
make  use  of  VTK’s  selection  linking  functionality  that  enables  users  to  make  and  visualize 
subsets  of  the  data  in  many  different  views  and  representations,  as  shown  in  Figure  16. 

The  use  of  molecule  fingerprinting  techniques  gives  the  database  the  ability  to  be  searched 
by  similarity  to  a  desired  structure,  as  well  as  enabling  queries  on  chemical  name,  tag, 
and  other  properties.  These  results  can  then  be  viewed  in  the  different  data  display  views 
available  in  order  further  inspect  the  selected  subset  of  data  or  do  further  calculations/export. 
This  enables  integration  of  independently  developed  software,  such  as  that  used  to  generate 
QSAR  data,  into  the  framework  while  enabling  scientists  to  make  use  of  the  analytical 
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Figure  16:  Linked  selection  in  several  charts  and  the  table  view  in  MongoChem. 


capabilities  of  the  application. 

Many  of  the  charts  in  the  application  featnre  intelligent  tooltips  that  display  the  2D 
strnctnre  along  with  the  lUPAC  name,  as  shown  in  Fignre  17.  Network  connectivity  views 
based  both  on  hngerprint  and  strnctnral  similarity  enable  nsers  to  view  the  overall  relatedness 
of  componnds  in  the  database.  The  K-means  clnstering  view  displays  descriptor  valnes  using 
3D  charts  and  groups  them  based  on  similarity  (see  Figure  18);  the  view  features  interaction 
(panning,  zooming,  etc),  and  the  clustering  parameters  can  be  modihed. 

Large  data  sets  can  be  imported  using  simple  Python  scripts,  or  with  graphical  tools  such 
as  the  CSV  importer  in  the  application.  In  addition  to  the  desktop  functionality,  a  prototype 
web  application  has  been  developed  which  shows  the  data  from  the  same  MongoDB  store 
using  modern  web  techniques  coupled  with  the  Python-based  server-side  frameworks  and 
VTKWeb  framework  to  give  any  user  read-only  access  to  the  data,  and  the  ability  to  visualize 
basic  2D  and  3D  structure  (with  interaction  for  the  3D  visualization). 

In  addition  to  the  charts  and  table  view  a  detailed  dialog  is  available  once  a  single 
molecule  has  been  selected,  as  shown  in  Figure  19;  this  gives  further  details  on  the  structure, 
such  as  InChl  and  SMILES  strings.  The  detailed  table  views  enable  the  export  of  structures, 
or  to  directly  open  the  structure  in  Avogadro.  Structures  can  have  multiple  tags  that  are 
searchable,  and  annotations  can  be  saved  with  notes  relevant  to  the  structure. 

A  collaboration  with  the  Aspuru-Guzik  group  at  Harvard  University  has  also  made  avail¬ 
able  a  large  number  of  electronic  structure  calculations.  The  data  set  is  about  0.5  PB  in  size, 
and  an  initial  4  TB  sample  has  been  duplicated  for  testing  and  performance  benchmarking. 
The  data  set  is  interesting  for  the  MongoChem  application  as  it  includes  a  large  number  of 
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Figure  17:  Custom  tooltips  in  scatter  plots  displaying  the  2D  structure  and  lUPAC  name 
of  the  point  under  the  cursor. 
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Figure  18:  K- means  clustering  view  in  MongoChem. 
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Figure  19:  The  detailed  dialog  in  MongoChem  for  a  molecular  structure. 


small  molecule  structures  that  are  candidates  for  organic  solar  cell  materials  with  multiple 
calculations  per  structure  using  different  levels  of  theory  and  calculation  types.  A  modihed 
version  of  Q-Chem  was  used  to  perform  the  calculations  on  the  World  Community  Grid.^^^l 

There  is  a  live  demo  available  at  data.openchemistry.org  that  contains  a  snapshot  of 
the  Clean  Energy  Project’s  data,  showing  a  customized  web  view,  searchable  database, 
and  capability  to  both  display  and  download  3D  structures.  The  client  code  is  largely 
JavaScript/HTMLS,  using  simple  RESTful  APIs  to  retrieve  data — these  same  APIs  could 
be  used  by  desktop  applications.  The  data  available  on  the  site  can  also  be  displayed  and 
edited  in  the  desktop  MongoChem  application. 

Figure  20  shows  a  capture  of  the  card  view  offered  by  the  site,  summarizing  some  of 
the  calculation  details.  Clicking  on  any  card  opens  up  a  detailed  view,  which  also  has  an 
interactive  3D  view  of  the  structure.  The  web  demo  uses  an  in-memory  hngerprint  database 
in  order  to  accelerate  similarity  and  substructure  searches,  which  remain  one  of  the  most 
important  queries  that  are  quite  poorly  supported  by  the  underlying  MongoDB  database 
server  technology. 

During  the  course  of  development  it  became  clear  that  the  MongoDB  C-I--I-  client  li¬ 
braries  do  not  offer  a  stable  API,  and  the  MongoDB  facilities  need  to  be  augmented  with 
more  capabilities  on  the  server-side  such  as  the  in-memory  structure  search  capabilities.  A 
more  complete  solution  would  implement  simple  authentication,  access  control,  and  acceler¬ 
ation  capabilities  on  the  server-side  with  a  thin  interface  through  to  MongoDB.  This  would 
make  the  offering  more  flexible  and  extensible,  but  would  require  more  investment  in  order 
to  fully  realize.  The  solution  developed  remains  viable,  even  for  collections  of  millions  of 
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Figure  20:  The  MongoChemWeb  live  demo  showing  data  from  the  Clean  Energy  Project. 


molecules,  but  has  some  bottlenecks  in  search  capabilities  and  could  only  be  deployed  on 
trusted  networks  at  this  stage  (with  an  Internet  facing  web  component  optionally,  such  as 
that  prototyped). 

The  application  features  integration  with  MoleQueue,  where  subsets  can  use  the  input 
generator  framework  from  Avogadro  2  and  submit  computational  jobs.  It  also  features  inte¬ 
gration  with  Avogadro  2,  offering  the  capability  to  open  structures  in  the  running  Avogadro 
2,  and  show  similar  structures  in  its  database  when  initiated  from  Avogadro  2.  These  local 
RPC  calls  leverage  the  framework  developed  in  MoleQueue,  and  more  calls  can  easily  be 
added  in  future.  The  application  makes  use  of  Avogadro  2  rendering  widgets,  file  formats, 
input  gecommitnerators,  with  scope  to  increase  the  level  of  interaction  in  future  versions. 

2.8  MoleQueue 

The  MoleQueue  application  is  a  C-I--I-  Qt  application  developed  to  provide  an  abstraction  to 
local  and  remote  computational  resources.  Its  functionality  is  not  chemistry-specihc,  but  it  is 
necessary  in  order  to  remove  many  of  the  barriers  encountered  by  users  attempting  to  make 
use  of  computational  chemistry  applications  on  both  the  desktop  and  remote  computational 
resources. 
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The  project  provides  two  components:  a  system-tray  resident  application  where  remote 
and  local  compntational  resonrces  are  conhgured  to  act  as  a  local  job  dispatch  server,  and  a 
client  that  uses  a  remote  procedure  call  (RPC)  API  to  stage  and  submit  computational  jobs. 
The  RPC  API  uses  a  data  structure  called  JSON  (JavaScript  Object  Notation)  to  carry  data 
in  a  language  and  architecture  independent  fashion.  This  format  was  chosen  due  to  the  vast 
array  of  implementations  in  virtually  every  programming  language.  The  JSON  RPC  2.0 
specihcation  builds  upon  the  JSON  data  format  in  order  to  provide  a  cross-platform,  device¬ 
independent  RPC  API  that  can  easily  be  implemented  in  any  language  desired.  Finally,  the 
local  socket  transport  has  been  chosen  due  to  the  security  considerations  that  require  local 
sockets  to  follow  the  same  permission  model  as  hies  on  every  operating  system  supported 
-  only  users  with  access  to  the  users  hies  on  the  local  system  can  access  a  local  socket  to 
submit  jobs. 

It  is  possible  to  add  further  transports,  but  the  communication  protocol  will  remain  very 
simple  and  easy  to  implement.  A  C-I--I-  Qt  client  library  is  provided,  along  with  some  refer¬ 
ence  implementations  for  submitting  jobs  using  the  Python  language.  The  JSON-RPC  2.0 
specihcation  consists  of  a  protocol  identiher  string- value  pair,  request,  response,  notihcation, 
and  error  messages.  Requests  consist  of  a  JSON-RPC  message  that  contains  a  method  key 
with  a  corresponding  method  name  and  a  params  object  that  contains  any  method  call  pa¬ 
rameters  necessary  to  complete  the  request.  The  client  will  receive  a  reply  or  an  error  with 
an  id  matching  that  of  the  request. 

Some  simple  examples,  such  as  requesting  a  list  of  queues  and  their  programs  is  as  simple 
as: 

{ 

" j  sonrpc " :  "2.0", 

"method":  " listQueues " , 

"id":  42 

} 

The  response  to  this  request  might  look  something  like  this: 

{ 

"  j  sonrpc " :  "2.0", 

"result":  { 

" Diamond " :  [ 

"GAMESS" , 

"MOPAC" , 

" Gaussian " , 

"NWChem" 

]  , 

" Garnet " :  [ 

"GAMESS" , 

"MOPAC" , 

" Gaussian " , 

"NWChem" 

]  , 

"Local" :  H 

"GAMESS" , 

"MOPAC" , 

" Gaussian " , 

"NWChem" 
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] 

>, 

"id":  42 

> 

A  slightly  more  complex  RPC  call  to  submit  a  job  using  MoleQueue  would  look  as  follows: 

{ 

" j  sonrpc " :  "2.0", 

"method":  "submit  Job", 

" par ams  "  :  f 

"queue":  "Garnet", 

"program":  "GAMESS", 

"description":  "B3LYP  H2D  optimization", 

" inputF ile  "  :  { 

"filename":  "job.inp", 

"contents":  "Full  contents  of  input  file.\nWill  be  created  in 
the  working  tree  .  " 

} 

"id":  23 

> 

This  submits  a  job  to  the  remote  queue  named  “Garnet,”  with  the  program  named  “GAMESS” . 
The  description  is  the  string  that  will  show  up  in  the  MoleQueue  user  interface,  and  the  in¬ 
put  hie  is  specihed  by  either  a  full  path  or  a  hie  name  and  contents  string.  The  response  for 
a  successful  submission  looks  something  like  the  following: 

{ 

" j  sonrpc " :  "2.0", 

"result":  { 

"moleQueueld":  17, 

" workingDirectory " :  " / tmp/MoleQueue / 17/ " 

>, 

"id":  23 

> 

This  response  object  gives  a  long-lived  identiher  for  the  job,  “moleQueueld,”  along  with  the 
working  directory  where  ah  hies  will  be  staged.  If  there  was  an  error  then  an  error  response 
will  be  generated  (with  an  “id”  matching  that  of  the  request)  and  an  appropriate  error,  such 
as  queue  or  program  does  not  exist.  Once  a  job  is  submitted,  notihcations  are  sent  when  the 
job  state  changes;  for  example,  from  submitted  to  running,  error,  completed  etc.  Each  of  the 
notihcations  carries  the  moleQueueld,  along  with  the  notihcation  of  the  job-state  change. 

It  is  then  possible  to  act  upon  these  changes.  There  are  also  RPG  methods  to  query  jobs, 
which  can  give  access  to  remote  job  identihers  or  to  cancel  an  already  submitted  job. 

The  MoleQueue  application  runs  in  the  system  tray,  and  provides  a  graphical  interface 
to  dehne  remote  queues,  how  to  execute  programs  on  those  remote  queues,  and  act  as  an 
interface  for  event  logging/current  job  status  on  ah  remote  systems  to  which  MoleQueue 
has  submitted  jobs.  Figure  21  shows  the  program  conhguration  dialog;  those  familiar  with 
PBS  submission  scripts  should  recognize  the  parameters.  A  simple  keyword  substitution  is 
used  to  replace  keywords  specihed  in  the  template  with  job-specihc  settings.  This  hie  will 
be  constructed  upon  job  submission  and  uploaded  to  the  remote  system.  It  will  submitted 
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to  the  batch  scheduler,  typically  using  the  qsub  command  or  its  equivalent.  Queue/program 
settings  can  be  imported  and  exported,  facilitating  the  easy  set  up  of  queues  across  sites  if 
system  administrators  distribute  relevant  hies  with  their  submission  criteria. 


Configure  Program 

Input  Template 

Launch  Syntax:  Custom 

#!/bin/sh 

# 

#  Sample  job  script  provided  by  MoleQueue. 

# 

I  #These  commands  set  up  the  Grid  Environment  for  your  job: 
#PBS  -N  MoleQueueTest 
#PBS  -i  nodes=l;ppn=l 
#PBS  -q  parity 
#PBS  -M  user@host[edu 
##PBS  -m  abe 

I  source  /etc/profile. d/modules. sh 
module  load  gamess 

export  SCRATCH=$TMPDIR/scratch 

echo  "Using  $SCRATCH  as  GAMESS  scratch." 

mkdir-p  $SCRATCH 

I  cp  -V  $PBS_0_WORKDIFVjob,inp  job.inp 
I  rungms  job.inp  >  job.out 
I  cp  -v  job,  out  $PBS_0_WORKDIR/job.out 


[  Template  IHelp  j 
I  ^  OK  II  0  Cancel  | 

Figure  21:  The  program  configuration  dialog  in  MoleQueue. 

In  order  to  be  useful  on  as  many  high  performance  computing  resources  as  possible,  it  was 
necessary  to  implement  several  secure  transport  methods.  The  hrst  of  which  is  SSH  (secure 
shell)  which  is  an  industry  standard  employed  by  many  of  the  world’s  largest  supercomputers 
to  provide  access  to  computational  resources.  The  MoleQueue  application  can  call  a  specihed 
SSH  command  line  client,  or  make  use  of  the  libssh2  library.  The  main  reason  for  providing 
support  for  both  is  the  lack  of  Kerberos  support  in  libssh2  and  the  custom  patches  applied 
by  some  sites  to  SSH  clients  in  order  to  support  different  challenge-response  authentication 
techniques.  Once  a  transport  has  been  chosen  to  support  authentication,  command  dispatch 
and  hie  transfer,  it  is  then  necessary  to  support  the  batch  scheduling  systems  in  use  on  the 
HPC  resource — primarily  Sun  Grid  Engine,  PBS,  and  SLURM. 

Support  for  SSH  with  major  batch  scheduling  systems  enables  MoleQueue  to  support  a 
large  range  of  supercomputers  and  cloud  resources  (including  Amazon’s  EC2  when  used  with 
StarCluster  to  deploy  a  Sun  Grid  Engine  cluster  on  demand).  The  MoleQueue  backends  are 
abstracted  in  such  a  way  to  allow  for  many  compute  resource  backends  to  be  added.  In  order 
to  take  advantage  of  the  HPG  resources  provided  to  ERDG  researchers,  integration  with  the 
ezHPG  platform  was  also  necessary.  This  has  been  accomplished  using  two  approaches,  the 
hrst  being  the  more  generic  SSH  transport  coupled  with  PBS,  and  the  second  being  the  use 
of  the  HIT  SOAP-like  API  provided  for  automated  use  of  HPG  resources  over  an  HTTPS 
transport. 

Unfortunately  the  UIT  documentation  does  not  provide  enough  detail  in  some  instances, 
and  is  incorrect  in  a  few  places.  Only  parts  of  the  HPG  access  have  been  modeled  in 
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SOAP,  and  so  it  is  necessary  to  manually  parse  many  of  the  responses  and  match  them 
to  the  raw  PBS  response  and  error  codes.  Kitware  developers  have  spent  signihcant  time 
using  a  combination  of  traditional  SOAP  implementations,  following  documentation,  trial- 
and-error,  and  direct  communication  with  ezHPC  UIT  support  staff  in  order  to  provide  a 
working  implementation.  We  have  reported  several  issue/bugs  to  the  UIT  support  address, 
and  will  continue  to  work  with  them  to  resolve  these  issues.  At  this  stage,  the  vast  majority 
of  the  core  functionality  is  implemented,  but  there  are  error  conditions  and  event  sequences 
where  the  API  tokens  and  data  flow  remain  unclear.  Due  to  the  MoleQueue  abstraction,  it 
is  possible  to  use  a  UIT  queue  or  a  direct  SSH/PBS  queue  to  dispatch,  monitor,  and  retrieve 
results. 

2.9  VTK 

The  Visualization  Toolkit  (VTK)  is  a  large,  open  source,  cross-platform  toolkit  for  data 
processing  and  visualization.  It  has  many  specialized  classes  for  data  processing,  informatics, 
mathematics,  data  handling,  computational  fluid  dynamics,  geospatial  visualization,  medical 
imaging,  charting,  volume  rendering,  and  other  areas.  One  area  that  has  not  been  added  to 
VTK  until  now  was  support  for  chemical  data  structures  and  visualization.  Many  projects 
have  used  VTK  for  molecular  rendering  and  visualization,  but  have  had  to  extend  it  in  their 
own  applications  and  have  not  been  able  to  benefit  from  built-in  support. 

VTK  has  been  extended  with  a  dedicated  chemistry  module  that  provides  hardware- 
accelerated  visualization  making  use  of  advanced  support  for  glyphs  in  order  to  get  maxi¬ 
mum  performance.  Support  for  standard  atom  color  schemes  and  the  standard  molecular 
representations  have  been  added.  The  readers  have  been  augmented  to  read  in  secondary 
protein  structure  and  use  the  ribbon  rendering  representations  expected  for  larger  biological 
structures.  Additional  hie  format  support  has  been  added,  along  with  optimizations  for 
larger  structures  and  interaction. 

This  means  that  applications  using  VTK  can  beneht  from  built-in  support  for  chemical 
structure  visualization,  along  with  all  the  other  visualization  techniques  and  data  processing 
code  present  in  the  library.  The  2D  visualization  techniques  have  also  been  extended  in 
order  to  better  support  applications  in  chemistry,  such  as  custom  tooltip  support  (enabling 
2D  structures  to  be  displayed  in  tooltips)  and  support  for  multidimensional  visualization  and 
selection.  Various  additional  chart  types  and  support  for  seamless  2D-to-3D  chart  transitions 
offer  more  immersive  visualization  and  analysis  environments  in  Open  Chemistry  applica¬ 
tions.  The  client-server  applications  using  VTK,  such  as  ParaView  and  ParaViewWeb,  can 
also  beneht  from  this  new  functionality  and  be  leveraged  in  chemical  applications. 


3  Conclusions 

The  project  has  achieved  all  core  goals,  and  has  prompted  several  new  collaborations  that  are 
resulting  in  wider  impacts  in  the  chemistry  community.  The  three  Open  Chemistry  appli¬ 
cations  (MongoChem,  MoleQueue,  and  Avogadro  2)  are  available  in  both  source  and  binary 
form  for  all  major  platforms.  The  JSON-RPC  2.0  inter-process  communication  APIs  and 
common  data  structures  have  been  developed  to  facilitate  seamless  communication  between 
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the  loosely  coupled  components. 

The  Avogadro  2  libraries  and  application  have  been  demonstrated  on  small  molecules 
involving  expensive  quantum  calculations  through  to  large  molecular  dynamics  simulations 
containing  over  2.8  million  atoms.  The  MongoChem  application  also  saw  similar  success,  be¬ 
ing  useful  for  analyzing  small  molecule  collections  through  to  some  of  the  largest  available — 
the  Harvard  Clean  Energy  project  with  millions  of  unique  molecules  and  hundreds  of  millions 
of  quantum  calculations. 

The  MoleQueue  project  gained  some  additional  funding  from  an  ERDC  PETTT  project 
for  use  in  another  application  domain,  and  several  National  Labs  have  expressed  an  interest 
in  integrating  MoleQueue  in  their  applications,  including  a  climate  modeling  project  demo. 
This  interest  has  led  to  successful  integration  with  several  supercomputers  and  schedulers  (in 
addition  to  UIT/ezHPC,  Sun  Grid  Engine  and  PBS  as  part  of  this  project).  Collaborations 
with  EMSL’s  NWChem  project  and  Harvard’s  clean  energy  project  have  provided  multiple 
viewpoints  on  current  opportunities  in  the  area  for  powerful  desktop  applications  in  prepar¬ 
ing  input,  integrating  with  HPC  resources  and  applying  cheminformatics  techniques  to  the 
indexing  and  analysis  of  large  numbers  of  quantum  calculations/small  molecules.  Collab¬ 
orations  with  the  Peter  Tieleman  group,  and  discussions  with  researchers  based  at  Sandia 
and  Los  Alamos  National  Laboratories  have  offered  a  view  of  work  taking  place  in  large 
molecular  dynamics  simulations  containing  millions  of  atoms. 

The  projects  have  gained  a  signihcant  feature  set,  and  offer  unique  capabilities.  Some 
initial  funding  has  been  obtained  to  explore  multiscale  modeling  approaches,  using  the  results 
of  this  project  as  one  of  the  major  foundations.  Other  nascent  collaborations  are  exploring 
areas  as  diverse  as  heavy  element  compound  calculations,  large-scale  materials  simulations, 
biological  system  and  drug  delivery  systems.  We  remain  committed  to  the  approach  taken, 
and  will  continue  to  develop  the  project.  All  projects  were  successfully  migrated  to  Qt  5 
recently,  ensuring  their  viability  in  the  coming  years.  The  software  libraries,  applications 
and  patterns  developed  position  Kitware  well  to  become  a  major  force  in  this  and  related 
areas. 
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